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Abstract 

The success of deep architectures is at least in part attributed to the layer-hy-layer unsupervised pre-training that 
initializes the network. Various papers have reported extensive empirical analysis focusing on the design and imple¬ 
mentation of good pre-training procedures. However, an understanding pertaining to the consistency of parameter 
estimates, the convergence of learning procedures and the sample size estimates is still unavailable in the literature. 
In this work, we study pre-training in classical and distributed denoising autoencoders with these goals in mind. We 
show that the gradient converges at the rate of and has a sub-linear dependence on the size of the autoencoder 
network. In a distributed setting where disjoint sections of the whole network are pre-trained synchronously, we show 
that the convergence improves by at least where r corresponds to the size of the sections. We provide a broad 
set of experiments to empirically evaluate the suggested behavior. 


1 Introduction 


In the last decade, deep learning models have provided state of the art results for a broad spectrum of problems in 
computer vision [Krizhevsky et al. ( 2012| l; Taigman et al. ( 2014| l, natural language processing Socher et al. ( 201 la|b| l, 
machine learning Hamel & Eck (20101; [Dahl et al.' ( 2011| l arid biomedical imaging Plis et al. ( 2013| l. The underly¬ 
ing deep architecture with multiple layers of hidden variables allows for learning high-level representations which 
fall beyond the hypotheses space of {shallow) alternatives Bengio ( 2009| l. This representation-learning behavior is 
attractive in many applications where setting up a suitable feature engineering pipeline that captures the discriminative 
content of the data remains difficult, but is critical to the overall performance. Despite many desirable qualities, the 
richness afforded by multiple levels of variables and the non-convexity of the learning objectives makes training deep 
architectures challenging. An interesting solution to this problem proposed in Hinton & Salakhutdinov (|2006|; Bengio 
et al. ( 2007|l is a hybrid two-stage procedure. The first step performs a layer-wise unsupervised learning, referred to as 


‘pre-training”, which provides a suitable initialization of the parameters. With this warm start, the subsequent discrim¬ 
inative (supervised) step simp\y fine-tunes the network with an appropriate loss function. Such procedures broadly fall 
under two categories - restricted Boltzmann machines and autoencoders [Bengio ( |2009| l. Extensive empirical evidence 
has demonstrated the benefits of this strategy, and the recent success of deep learning is at least partly attributed to 
pre-training Bengio[ (2009 1 ; Erhan et al. ( [2010| l; Coates et al. pOll i. 

Given this role of pre-training, there is significant interest in understanding precisely what the unsupervised phase 


does and why it works well. Several authors have provided interesting explanations to these questions. Bengio 


interprets pre-training as providing the downstream optimization with a suitable initialization. Erhan et al. ( [2009 
20101 presented compelling empirical evidence that pre-training serves as an “unusual form of regularization” which 
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biases the parameter search by minimizing variance. The influence of the network structure (lengths of visible and 
hidden layers) and optimization methods on the pre-training estimates have been well studied Coates et al. P011| l; 
[Ngiam et al.| ( |201 fj l. [Dahl et aL]p011| l evaluate the role of pre-training for DBN-HMMs as a function of sample sizes 
and discuss the regimes which yield the maximum improvements in performance. A related but distinct set of results 
describe procedures that construct “meaningful” data representations. Denoising autoencoders Vincent et al. (20101 
seek representations that are invariant to data corruption, while contractive autoencoders (CA) Rifai et al. ( 2011b[ l se^ 
robustness to data variations. The manifold tangent classifier Rifai et al. ( 201 la| l searches for low dimensional non¬ 
linear sub-manifold that approximates the input distribution. Other works have shown that with a suitable architecture, 
even a random initialization seems to give impressive performance Saxe et al. (201 l[l. Very recently, [Livni et ak (2014 1 ; 
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Bianchini & Scarselli (20141 have analyzed the complexity of multi-layer neural networks, theoretically justifying 


that certain types of deep networks learn complex concepts. While the significance of the results above cannot be 
overemphasized, our current understanding of the conditions under which pre-training is guaranteed to work well is 
still not very mature. Our goal here is to complement the above body of work by deriving specific conditions under 
which this pre-training procedure will have convergence guarantees. 

To keep the presentation simple, we restrict our attention to a widely used form of pre-training — Denoising au¬ 
toencoder — as a sandbox to develop our main ideas, while noting that a similar style of analysis is possible for other 
(unsupervised) formulations also. Denoising auto-encoders (DA) seek robustness to partial destruction (or corruption) 
of the inputs, implying that a good higher level representation must characterize only the ‘stable’ dependencies among 
the data dimensions (features) and remain invariant to small variations [Vincent et al.| ( [2()10| l. Since the downstream 
layers correspond to increasingly non-linear compositions, the layer-wise unsupervised pre-training with DAs gives 
increasingly abstract representations of the data as the depth (number of layers) increases. These non-linear transfor¬ 
mations (e.g., sigmoid functions) make the objective non-convex, and so DAs are typically optimized via a stochastic 
gradients. Recently, large scale architectures have also been successfully trained in a massively distributed setting 
where the stochastic descent is performed asynchronously over a cluster Dean et al. ( |2012| l. The empirical evidence 
regarding the performance of this scheme is compelling. The analysis in this paper is an attempt to understand this be¬ 
havior on the theoretical side (for both classical and distributed DA), and identify situations where such constructions 
will work well with certain guarantees. 

We summarize the main contributions of this paper. We first derive convergence results and the associated sample 
size estimates of pre-training a single layer DA using the randomized stochastic gradients [Ghadimi & Lan ( |2013| l. We 

show that the convergence of expected gradients is O ( 


and the number of calls (to a first order oracle) is 




V %Av 

where dt and correspond to the number of hidden and visible layers, N is the number of iterations, 
and e is an error parameter. We then show that the DA objective can be distributed and present improved rates while 
learning small fractions of the network synchronously. These bounds provide a nice relationship between the sample 
size, asymptotic convergence of gradient norm (to zero) and the number of hidden/visible units. Our results extend 
easily to stacked and convolutional denoising auto-encoders. Finally, we provide sets of experiments to evaluate if the 
results are meaningful in practice. 


2 Preliminaries 


Autoencoders are single layer neural networks that learn over-complete representations by applying nonlinear trans¬ 
formations on the input data [Vincent et al. ( 2010| l; Bengio (2009 1 . Given an input x, an autoencoder identifies repre¬ 
sentations of the form h = (t(Wx), where W is a d/j x dy transformation matrix and a denotes point-wise sigmoid 
nonlinearity. Here, dy and dh denote the lengths of visible and hidden layers respectively. Various types of autoen¬ 
coders are possible depending on the assumptions that generate the h’s — robustness to data variations/corruptions, 
enforcing data to lie on some low-dimensional sub-manifolds etc. Rifai et al. ( 201 lb|a|l. 

Denoising autoencoders are widely used class of autoencoders Vincent et al. ( 2010| l, that learn higher-level rep¬ 
resentations by leveraging the inherent correlations/dependencies among input dimensions (j = 1,..., dy), thereby 
ensuring that h is robust to changes in less informative input/visible units. This is based on the hypothesis that abstract 
high-level representations should only encode stable data dependencies across input dimensions, and be robust to spu¬ 
rious correlations and invariant features. This is done by ‘corrupting’ each individual visible dimension randomly, 
and using the corrupted version (x’s) instead, to learn h’s. The corruption generally corresponds to ignoring (setting 
to 0) the input signal with some probability (denoted by Q, although other types of additive/multiplicative corruption 
may also be used. If Kj is the input at the unit, then the corrupted signal is xj = Xj with probability 1 — C 
0 otherwise where j = 1,... ,dy. Note that each of the dy dimensions are corrupted independently with the same 
probability C. DA pre-training then corresponds to estimating the transformation W by minimizing the following 


objective Bengio (20091, 


mm 

w 


Ep(x,x)llx-a(W^a(Wx) 


( 1 ) 


where the expectation is over the joint probability p{x, x) of sampling an input x ^ V and generating the correspond¬ 
ing x|x using (j. The bias term (which is never corrupted) is taken care of by appending inputs x with 1. 

For notational simplicity, let us denote the process of generating {x, x} by a random variable ry, i.e., one sample 
of p corresponds to a pair {x, x} where x is constructed by randomly corrupting each of the dy dimensions of x 
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with some probability C. Then, if the reconstruction loss is C{r]; W) := £(x, x; W) = ||x — (T(W^cr(Wx))|p, the 
objective in (j4^ becomes 

imn f(W):=E^C{r,-W) (2) 


Observe that the loss £( 77 ; W) and the objective in (|^ constitutes an expectation over the randomly corrupted sample 
pairs 1 ] := {x, x}, which is non-convex. Analyzing convergence properties of such an objective using classical 
techniques, especially in a (distributed) stochastic gradient setup, is difficult. Therefore, given that the loss function 
is a composition of sigmoids, one possibility is to adopt convex variational relaxations of sigmoids in Q and then 
apply standard convex analysis. But non-convexity is, in fact, the most interesting aspect of deep architectures, and so 
the analysis of a loose convex relaxation will be unable to explain the empirical success of DAs, and deep learning in 
general. 

High Level Idea. The starting point of our analysis is a very recent result on stochastic gradients which only 
makes a weaker assumption of Lipschitz differentiability of the objective (rather than convexity). We assume that the 
optimization of Q proceeds by querying a stochastic hrst order oracle {SFO), which provides noisy gradients of the 
objective function. For instance, the SFO may simply compute a noisy gradient with a single sample 77^ := jxg, Xsl 
at the iteration and use that alone to evaluate Vw£(? 7 ^; W^). The main idea adapted from Ghadimi & Lan (2013[) 


to our problem is to express the stopping criterion for the gradient updates by a probability distribution Pfl(-) over 
iterations fc, i.e., the stopping iteration is fc ~ Pfl(0 (and hence the name randomized stochastic gradients, RSG). 
Observe that this is the only difference from classical stochastic gradients used in pre-training, where the stopping 
criterion is assumed to be the last iteration. RSG will offer more useful theoretical properties, and is a negligible 
practical change to existing implementations. This then allows us to compute the expectation of the gradient norm, 
where the expectation is over stopping iterations sampled according to For our case, the updates are given by. 




(3) 


where, = Vw\'W^) is the noisy gradient computed at iteration ( 7 *^ is the stepsize). We have 

flexibility in specifying the distribution of stopping criterion Pfl(-)- F can be fixed a priori or selected by a hyper¬ 
training procedure that chooses the best P_r(-) (based on an accuracy measure) from a pool of distributions Vr. With 
these basic tools in hand, we first compute the expectation of gradients where the expectation accounts for both the 
stopping criterion k ~ Pfl(') and 7 := {x, x}. We show that if the stepsizes 7 ^ in § are chosen carefully, the 
expected gradients decrease monotonically and converge. Based on this analysis, we derive the rate of convergence 
and corresponding sample size estimates for DA pre-training. We describe the one-layer DA (i.e., with one hidden 
layer) in detail, and all our results extend easily to the stacked and convolutional settings since the pre-training is done 
layer-wise in multi-layer architectures. 


3 Denoising Autoencoders (DA) pre-training 


We first present some results on the continuity and boundedness of the objective /(W) in (j^, followed by the conver¬ 
gence rates for the optimization. Denote the element in row and column of W by Wy where i = 1,... ,dv 
and j = We require the following Lipschitz continuity assumptions on £( 77 ;W^) and the gradient 

Vwij /(Wij), which are fairly common in numerical optimization. L and £' are Lipschitz constants. 


Assumption (Al). 


||£(r7;W,,)->C(77;W,,)|| < £||W,, - W,,|| V7,j 


Assumption (A2). 


||Vw.,/(W,,) - Vw.,/(W„.)|| < i'llW,, - W,,|| Wi,j, 


We see from (40 1 that £( 77 ; is symmetric in j. Depending on where is located in the parameter 


space (and the variance of each data dimension j), each C{r]; corresponds to some L^, and L will then be the 


maximum of all such L^ ’s (similarly for £')• 
Based on the definition of W^) and (S, 


, we see that the noisy gradients G{r]^;'W^) are unbiased estimates of 
the true gradient since Vw/(W*^) = E^fcG(77^W^). To compute the expectation of the gradients, Vw/(W^'), over 
the distribution governing whether the process stops at iteration k, i.e., R ^ Pi? we first state a result regarding the 
variance of the noisy gradients and the Lipschitz constant of Vw/(W*^). All proofs are included in the supplement. 
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Lemma 3.1 (Variance bound and Lipschitz constant). Vw/(W*^) = E^fcG'(77^; W^), we have 

Var(G(7?'=;W'=)) < dhdyL^ 

, _ . (4) 

II Vw/(W) - Vw/(W)|| < ^/^L'ljW - W|| 


Proof. Recall that the assumptions [Al] and [A2] are, 

[Al] ||/:(7;;W,,)-^(^;W.,)|! <L||W,,-W,,|| V 1,J 
[A2] ||Vw.,/(W,,) - Vw.,/(W„)|| < L'llW,, - W,,|| V 

The noisy gradient is defined as G(p^;W^) = Vw'C(77^; W^). Using the mean value theorem and [Al], we have 
|G(ry'=;W^.)l <i- This implies that the maximum variance of G(? 7 ^; ) is L^. We can then obtain the following 

upper bound on the variance of G{rf\ W^), 


E,.(||G(p^^V'=) - Vw/(W'=)f) = E,.(^(G(77^ W^) - Vw.,/(Wf^-))") 

■ij 

= ^Uar(G(r7'=;W5)) < 
ij 

Using [A2], we have 


(5) 


||Vw/(W) - Vw/(W)f = ^ ||Vw.,/(W,,) - Vw.,/(W.,)f 

< ^(LU)"I|W,, - E (6) 

<4d„L'2||W-Wf 


where the equality follows from the definition of ^ 2 -norm. The second inequality is from [A2]. The last two inequali¬ 
ties use the definition of £ 2 -norm and that L' is the maximum of all L[jS. □ 


Whenever the inputs x are bounded between 0 and 1, /(W) is finite-valued everywhere and there exists a minimum 
due to the bounded range of sigmoid in (40 1 . Also, /(•) is analytic with respect to W^Vf, j. Now, if one adopts the 
RSG scheme for the optimization, using Lemma [TT] we have the following upper bound on the expected gradients for 
the one-layer DA pre-training in (|^. 


Lemma 3.2 (Expected gradients of one-layer DA). Let N > 1 be the maximum number of RSG iterations with step 
sizes 7 ^ < Let P_r(-) be given as 


■=Pr{R = k) 

^ (7) 

Y!1=\ - L'^dhd.„{-i^Y) 

where k = 1,... ,N. If Df = 2(/(W^) — /*), we have 

E(||Vw/(W«)f) 

< Df + jVafL^L’Y.tiil’^? ( 8 ) 

■ EE ( 27 " - 


Proof Broadly, this proof emulates the proof of Theorem 2.1 in Ghadimi & Lan ( 2013| l with several adjustments. The 
Lipschitz continuity assumptions (refer to [Al] and [A2]) give the following bounds on the variance of G(jnf] W^') 
and the Lipschitz continuity of Sl-wfCW) (refer to Lemma [ tT|i, 


Var(G(p'=;W'=)) 

1 Vw/(W) - Vw/(W)|| < ^^L'llW - W|| 


(9) 
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Using the properties of Lipschitz continuity we have, 


/(W'=+1) < /(W'=) + (Vw/(W'=), - W'=) + - W'=f 

Since the update of using the noisy gradient is ^ — 7 ^G(p^; W^), where 7 ^ is the step-size, we 

then have. 


/(W^+i) < /(W'=) -7fc(Vw/(W'=),G(7^W'=)) + 


By denoting 6'^ := G{r]'^] W'^) - Vw/CW'^), 

/(W'^+i) < /(W'^) - 7 '=!! Vw/(W'=)f - 7'=(Vw/(W'=), 5'=) 


y/dhdyL 


irr IIvw/(w")r + 2 (Vw/(w^‘), o + ii^' 


?fc||2 


Rearranging terms on the right hand side above. 




- (^7" - \/^i'(7")")(Vw/(W'=),(5'=) + (7^)211 <5^112 


Summing the above inequality for /c = 1,..., A^, 


iV 


E ( ) ll/(W'=)f < /(Wi) - /(W^+i) 


fc = l 


- E ( 7 " - \/^i'( 7 '=)") (Vw/(W'=), < 5 ^=) + E( 7 '')'ll' 5 '=f 


where W° is the initial estimate. Using f* < /(W^+^), we have, 

, ii/(w'=)f < /(wi)-r 

^/dhdyV ^ 


N 

E 1 7 ' 

fe=l 


- E ( 7 " - V^L'(7'=)^) (Vw/(W'=), ^(^fe)2||5fe||2 

fc = l ^ fc = l 


( 10 ) 


( 11 ) 


We now take the expectation of the above inequality over all the random variables in the RSG updating process - 
which include the randomization 77 used for constructing noisy gradients, and the stopping iteration R ^ 

First, note that the stopping criterion is chosen at random with some given probability Pfl( ) and is independent of rj. 
Second, recall that the random process p is such that the random variable is independent of for some iteration 
number fc, because SJ-O selects then randomly. However, the update point depends on G(jn/^] W^) (which are 

functions of the random variables rf) from the hrst to the iteration. That is, is not independent of W*^, and 

in fact the updates form a Markov process. So, we can take the expectation with respect to the joint probability 
p{r]^^, R) = where denotes the random process from rj^ until 77 ^. We analyze each of the last two 

terms on the right hand side of ([TT]) by hrst taking expectation with respect to Tyl^l. The second last term becomes. 


E^[N] 


N 


E 7 " - ) (Vw/CW'^), 5'= 

.k=l ^ ' 

= E ( 7 " - v^i'(7")') ((Vw/(W'=), , 7 ") = 0 

fc = l ^ 


( 12 ) 
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where the last equality follows from the definition of 5^ = — Vw/(W^) and (G(? 7 *^; W*^)) = 

Vw/(W^)). Further, from Equation 1^ we have E^fc 11 (5^ IP = Var{G{vi^\'W^)) < dhdyL^. So, the expectation of 
the last term in ([TT|l becomes. 


E, 


.[N] 




k\\2 


k^l 


N 


y/dhdyL f_^k\2Tu< ('iixfcii2 

-^-2^(7 ) E^[iv]'"^ 


^ ^ jVdhdy)^L' ^( 7 ^)^ 


k=l 


k=l 


Using ([T 2 I 1 and ([T3]l and the inequality in ([TT]i we have. 


k=l 


k=l 


Using the definition of Pi?(fc) from Equation|^and denoting Dj = 2(/(W^) — /*), we finally obtain 

E,,,»,(iivw/(w7in = i; “ i'V43t(/)^)E„™(iiv„/(w«)in 


< 






(13) 


N N 

^ ( 27 ^= - x/^L'( 7 '=) 2 ) E^[«, |l/(W'=)f < 2 (/(Wi) - n + (y/d;A) 3 L'L 2 (14) 


(15) 


□ 


The expectation in (|^ is over rj and R ^ Pfl(')- Here, 7 *^ < ensures that the summations in the 

denominators of Pij( ) in Q and the bound in ([^ are positive. Df represents a quantity which is twice the deviation 
of the objective /(W) at the RSG starting point (W^) from the optimum. Observe that the bound in (j^ is a function 
of Df and network parameters, and we will analyze it shortly. 

As stated, there are a few caveats that are useful to point out. Since no convexity assumptions are imposed on the 
loss function. Lemma 3.2 on its own offers no guarantee that the function values decrease as N increases. In particular, 
in the worst case, the bound may be loose. Eor instance, when Df ^ 0 (i.e., the initial point is already a good estimate 
of the stationary point), the upper bound in ([^ is non-zero. Eurther, the bound contains summations involving the 
stepsizes, both in the numerator and denominator, indicating that the limiting behavior may be sensitive to the choices 
of 7 ^. The following result gives a remedy — by choosing 7 ^ to be small enough, the upper bound in ^ will decrease 
monotonically as N increases. 


Lemma 3.3 (Monotonicity and convergence of expected gradients). By choosing 7 ^ such that 

7 ^ + 1 < y^ith ryt ^ - 

L's/dhdy 

the upper bound of expected gradients in decreases monotonically. Further, if the sequence for 7 ^ satisfies 


(16) 


N 


N 


N—>-oo 


k^l 


lim 7 * —>• 00 , lim ( 7 ^)^ < 00 

J —Vrv~i ^ ^ hj— ^ ^ 

0 


then lim E(||Vw,/(W 
N—^OC 


k=l 


(17) 


Proof We first show the monotonicity of the expected gradients followed by its limiting behavior. Observe that 
whenever 7 ^ < A , , we have 

' L'y/dhdv 


(^2 - L'y/^ 7 '=^ > 1 V A: 


Then the upper bound in ([^ reduces to 

E(||Vw/(W«)|p) < 




Z^fe=i7 


(18) 
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To show that right hand side in the above inequality decreases as N increases, we need to show the following 




N 


.,k\2 


Z^fe=i 7 


< 


l^k^l 1 


By denoting the terms in the above inequality as follows, 


( 19 ) 


N 


a = Df + (v/^)3l"L' 


fc = l 


N 


( 20 ) 


b= c = ^ 7 ^ d = j 




k^l 


To show that the inequality in Equation [T^ holds, 

a ■ 


- b a 

< 

c + d c 


. a , 
b < -d 
c 


7 


( 21 ) 


Rearranging the terms in the last inequality above, we have 


N 


N 


k^l 


k^l 

N 


( 22 ) 


k=l 


Recall that Df = 2(/(W°) — /*); so without loss of generality we always have Df > 0. With this result, the last 
inequality in ( |22l ) is always satisfied whenever 7 ^+^ < 7 ^ for k = 1,..., A^. Since this needs to be true for all N, 
require 7 ^+^ < 7 ^ for k = 1,..., N — 1. This proves the monotonicity of expected gradients. For the limiting case, 
recall the relaxed upper bound from ( fTSl l. Whenever limjv->oo ^k- ^^ 7 '= 00 , limjv^ooE7i(7'‘)^ < 00 . the 

right hand side in ( fTS] ) converges to 0. □ 

The second part of the lemma is easy to ensure by choosing diminishing step-sizes (as a function of k). This result 
ensures the convergence of expected gradients, provides an easy way to construct Pi^(-) based on (j^ and (17 1 , and to 
decide the stopping iteration based on P/j(-) ahead of time. 

Remarks. Note that the maximum 7 ^ in (17i needed to ensure the monotonic decrease of expected gradients 
depends on L'. Whenever the estimate of L' is too loose, the corresponding 7 ^ might be too small to be practically 
useful. An alternative in such cases is to compute the RSG updates for some N (fixed a priori) iterations using 
a reasonably small stepsize, and select R to be the iteration with the smallest possible gradient || Vw/(W*^)|p or 
the cumulative gradient J2i=i IVw/(W*)|p among some last Ni < N iterations. While a diminishing stepsize 
following (17 1 is ideal, the next result gives the best possible constant stepsize 7 = 7 ^, Vfc, and the corresponding rate 
of convergence. 

Corollary 3.4 (Convergence of one-layer DA). The optimal constant step sizes 7 ^ are given by 

D 


7 = 


W(4d,)3/4 


Vfc; 0<Z7< 


(23) 


If we denote D — -f- + DLfL', then we have 


E(||Vw/(W7f)<.D 


{dhdyf/'^ 

s/N 


(24) 


7 




















Proof. Using constant stepsizes 7 ^ = 7 , fc = 1,..., the convergence bound in © reduces to 


E(||Vw/(W«)f) < 




N^{2 - 


( 25 ) 


To achieve monotonic decrease of expected gradients, we require 7^ < 


h 

7 s, 


which when used in 


/dhJpi 


(from ([TtIi in Lemma 


3.3 1 . For such 


>1 y k 


gives, 


E(||Vw/(W«)f) < ^ 


{yj dhdvf’L'-f 


(26) 


Observe that as 7 increases (resp. decreases), the two terms on the right hand side of above inequality decreases (resp. 
increases) and increase (resp. decreases). Therefore, the optimal 7 = 7 ^ for all k, is obtained by balancing these two 
terms, as in 


D 


Nj 


1 = 1 = 


y/mJu{dhd,Y/^ 


(27) 


However, the above choice of 7 ^ has the unknowns Df, L' and (although note that the later two constants can be 


empirically estimated by sampling the loss functions £(•) for different choices of x and W). Replacing yj by 
some D, the best possible choice constant stepsize is 

^ D 


7 = 7 = 


-V k 


Since 7 ^ needs to be smaller than 


L'y/dhdi 

D 


y/N{dhdy)y^ 

as discussed at the start of the proof, we have 

2 „ , 2VN, 


< 


D < 


-{dhdi 


xl/4 


ViV(4d„)3/4 Uy/dfdf ' - L' 

Now substituting this optimal constant stepsize from ( |28l l into the upper bound in 

E(||Vw/(W^)f) < ^ + {dhd^f/^L^L'-f 

A'7 


(28) 


(29) 


we get 


DL‘^L'{dhd,fl^ 


(30) 


y/ND 


y/N 


and by denoting D = 


Df 


DLfL', we hnally have 


E(||Vw/(W7r)<i7 


^{dhdy)^ 

y/N 


(31) 

□ 


The upper bound in ([^ can be written as a summation of two terms, one of which involves Df. The optimal 
stepsize inj2^ is calculated by balancing these terms as N increases (refer to the supplement). The ideal cho ice for 

D is yj 7^7 in which case D reduces to 2y/Djl7L/. For a hxed network size {dh and dy). Corollary 3.4 shows 

that the rate of convergence for one-layer DA pre-training using RSG is C7(l/V^). It is interesting to see that the 
convergence rate is proportional to {dhdy /^‘^ where the number of parameters of our bipartite network (of which DA 
is one example) is d^dy. 

Corollary |3.4| gives the convergence properties of a single RSG run over some R iterations. However, in practice 
one is interested in a large deviation bound, where the best possible solution is selected from multiple independent runs 
of RSG. Such a large deviation estimate is indeed more meaningful than one RSG run because of the randomization 
over rj in (|^. Consider a C-fold RSG with C > 1 independent RSG estimates of W denoted by ,..., . 

Using the expected convergence from (24 1 , we can compute a (e, 5)-solution dehned as. 
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Definition ((e, 5)-solution). For some given e > 0 and 5 G (0,1), an (e, i5)-solution of one-layer DA is given by 
{W*^<=},c= 1 ,... , C such that 


Pr [min II Vw/(W^‘=)f > eD) < 5 


(32) 


e governs the goodness of the estimate W, and 5 bounds the probability of good estimates over multiple inde¬ 
pendent RSG runs. Since N is the maximum iteration count (i.e., maximum number of SFO calls), the number of 
data instances required is S' = N/t, where t denotes the average number of times each instance is used by the oracle. 
Although in practice there is no control over t (in which case, we simply have S < N), we estimate the required 
sample size and the minimum number of folds (C) in terms of t, as shown by the following result. 


Corollary 3.5 (Sample size estimates of one-layer DA). The number of independent RSG runs (C) and the number 
of data instances (S) required to compute a (e, S)-solution are given by 


C{r,5)> 


log(^) 

log(VF) 


; S{r,e)> 




(33) 


where r > 1 is a given constant, [•] denotes ceiling operation and t denotes the average number of times each data 
instance is used. 


Proof Recall that a (e, i5)-solution is dehned such that 


Pr min || Vw/(W"‘=)f > eD ] < 6 
for some given e > 0 and 6 G (0,1). Using basic probability properties. 


(34) 


Pr ( min ||Vw/(W«=)f > eD] = Pr (||Vw/(W 


Ra 


>eD V c = 1,..., C) 
c 

= n^'’(ll^w/(W«=)f >eP) 


(35) 


C=1 


Using Markov inequality and ( |24| t, 


Pr(||Vw/(W^ 


> eD) < 
< 


E(Vw/(W^-)P) 

eD 

{dhdyfp 

es/N 


(36) 


Hence, the number of SPO calls per RSG is at least N > for the above probability to make sense. If r > 1 

is a constant, then the number of calls per RSG is N = — Using this identity, and ( |35] l and ( |3^ , we get 

c 


Pr ( min || Vw/(W^=)f >eD]<\{^ = 


1 


1 

fCj2 


To ensure that this probability is smaller than a given 5 and noting that C is a positive integer, we have 


1 


rC/2 


< 5 


C{r,5)> 


log{y/r) ■ 




log[y/r) 


(37) 


(38) 


where [•] denotes ceiling operation. Note that there is no randomization over the data instances among multiple 
instances of RSG (c = 1,..., C). That is, each RSG is going to use all the available data instances. Hence, we can 
just look at one RSG to derive the sample size required. Let S be the number of data instances, ts be the number of 
times instance is used in one RSG and t = E(fs) be the average number of times each instance/example is used. 
We then have 

N = J2ts iV E(A^) = SE{G) ^ S=j> (39) 

S — 1 

□ 
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The above result shows that the required sampl e siz e is 0(l/e^), which is easy to see from the convergence rate 


of 0{l/\fN) in (24i. The constant r in Corollary |3.5| acts like a trade-off parameter between the number of folds 
C{r, 5) and the sample size S{r, e). Hence, not surprisingly, more folds are needed to guarantee a (e, (5)-solution for a 

, below this quantity the idea of computing an (e, (5)-solution 


smaller S. Note that the minimum possible S is 


in a large deviation sense is not meaningful (refer to proof of Corollary |3.5| in the supplement). 

Remarks. To get a practical sense of consider a DA with dy = 100, dh = 20. According to Corollary |3. 5 


the number of data instances for computing a (0.05, 0.05)-solution with t = 10^ is at least 0.3 million. Depending 
on the structural characteristics of the data (variance of each dimension, correlations across multiple dimensions etc.), 
which we do not exploit, the bound from Corollary |3.5| will overestimate the required number of samples, as expected. 
Overall, the convergence and sample size bounds in |24|i and provide some justification of a behavior which is 
routinely observed in practice — large number of unsupervised data instances are required for efficient pre-training 
of deep architectures (Chapter 4, Bengio (2009i Erhan et al. ( 2010) l). Note that the results in the convergence bound 
in p4| ) do not differentiate between the visible and hidden layers, implying that the bound is symmetric with respect 
to dh and dy. However, there is empirical evidence that the choice of dh would affect the reconstruction error with 
oversized networks giving better generalization in general Lawrence et al. ( 1998| l; Paugam-Moisy \\991\ . This can be 
seen by recalling that until dh is more than the dimensionality of the low-dimensional manifold on which the input 
data lies, the DA setup may not be able to compute good estimates of W. We discuss this issue in more detail when 
presenting our experiments in Section]^ 

Recall that pre-training is done layer-wise in deep architectures with multiple hidden layers. Hence, the bounds 
presented above in p4] i and p^ directly apply to stacked DAs with no changes. For stacked DAs the total number 
of SJ-O calls would simply be the sum of the calls across all the layers. The results also provide insights regarding 
convolutional neural networks where one-to-two layer neural nets are learned from small regions (e.g., local neighbor¬ 
hoods in imaging data), whose outputs are then combined using some nonlinear pooling operation [Lee et ak (2009 1 . 
Observe that the sub-linear dependence of convergence rate on the network size (dh-dy) from ( p4| ) implies that when¬ 
ever S (and hence N) is reasonable large, small networks are learned efficiently. This partially supports the evidence 
that deep convolutional networks with multiple levels of pooling over large number of small networks are successful 


in learning complex concepts Lee et al. (20091; Krizhevsky et al. (20121. With these results in hand, we now consider 


the case of distributed synchronous pre-training where small parts of the whole network are learned at-a-time. 


4 Distributed DA pre-training 

The results in the previous section show that the convergence rate has polynomial dependence on the size of the 
network (dhdy), where the number of SJ^O calls increases as (dhdy)^^^. Although this is unlikely to happen in 
practice because of the redundancies across the input data dimensions (for example, sufficiently strong correlations 


across multiple input dimensions, presence of invariant dimensions etc.), the results in Corollaries 3.4 and 3.5 show 


that pre-training very large DAs is impractical with smaller sample sizes (and thereby fewer iterations). There is 
empirical evidence supporting that this is indeed the case in practice Mian^e^a^ 2009 i; Raina et al. (20091. Several 
authors have suggested learning parts of the network instead. Recently, Dean et akf ( 2012| l showed empirical results on 
how distributed learning substantially improves convergence while not sacrificing test-time performance. Motivated 
by these ideas, we extend the results presented in Sectionj^to the distributed pre-training setting. We first show that the 
objective in (40 1 lends itself to be distributed in a simple way where the whole network is broken down into multiple 
parts, and each such sub-network is learned in a synchronous manner. By relating the corruption probabilities of these 
sub-networks to that of the parent DA, we compute a lower bound on the number of sub-networks required. Later, we 
present the convergence and sample size results for this distributed DA pre-training setting. 


Recall that the objective of DA in (401 involves an expectation over corruptions x where certain visible units are 


nullified (set to 0). This implies that the corrupted dimension does not provide any information to the hidden layer. 
Since the DA network is bipartite, the objective can then be separated into sub-networks (referred to as sub-DAs) 
- while the hidden layer remains unchanged, we use only a subset of all available dy visible units. For each such 
sub-DA of size {\Tdy~\,dh), where 0 < r < 1 is the fraction of the visible layer used, the inputs from all the left 
out (dy — \Tdy~\) visible units is zero i.e., their corruption probability is 1. Now, consider the setting where B such 
sub-DAs constructed by sampling \Tdy~\ number of visible units with replacement. The following result shows the 
equivalence of learning these B sub-DAs to learning one large DA of size (dy,dh). 
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Lemma 4.1 (Distributed learning of one-layer DA). Consider a DA network of size (dy,dh) with corruption proba¬ 
bility C and some 1 — (^ < r < 1 and 0 < cj) 1. Learning this DA is equivalent to learning B > number of 

DAs of size (\Tdf \, dh) with corruption probability 1 — whose visible units are a fraction t (with replacement) 
of the total available d^ units, where [•] denotes the ceiling operation. 

Proof Recall that the DA objective is 

imn Ep(,,. 5 t)||x-a(W^a(Wx))f (40) 

By considering one term from this expectation, we show that it is equivalent to learning two disjoint DAs of sizes 
{\Tdy\,dh) and (dy — \Tdv\,dh) synchronously. Without loss of generality, let this term correspond to the last 
dy — (rdy) visible units be corrupted with probability 1 i.e., are set to 0. For the rest of the proof, any visible unit that 
is set to 0 via corruption will be referred to as a ‘clamped’ unit. 

Let Wi (of size dh x (rdy)) and W2 (of size dh X- dy — (rdy]) be the matrices of edge weights (i.e., unknown 
parameters) from the un-clamped and clamped visible units to all dh hidden units respectively. For some inputs x, let 
xi (of length (rdy) x 1) and X 2 (dy — (rdy) x 1) be the un-clamped and clamped parts. Hence xi = Xi and X 2 = 0. 
Then the hidden activation h, and the corresponding un-clamped and clamped reconstructions, Xi and X2 have the 
following structure, 

h = a{Wi±i -I- W2O) = cr(fLiXi) Xi = a(Wi aiWi^Li)) 

X2 = cr(WiXi)) = a(W2 <j(WiXiW 2 X 2 )) 

The objective for the term considered then simplifies to 

||x- x||^ = ||xi - (T(FF]^cr(fLiXi))||^ -f ||X 2 - a(W 2 cr(VFiXi -f FF 2 X 2 ))||^ (42) 


It is easy to see that the first term from the above summation is exactly minimizing the recovery of xi with no 
corruption applied to it. That is to say, it corresponds to one of the terms in the objective of a smaller DA of size 
\Tdy~\ , dh- The second term in the summation has similar structure however with an extra WiXi within the inner 
sigmoid. If Wi is fixed, then this the second term is minimizing the recovery of X 2 with ‘complete’ corruption applied 
to all the 1 — \Tdy~\ dimensions. Hence we can first pre-train the ((rdy) , dh) sized sub-DA, and use the learned Wi as 
a constant bias, and then learn the (dy — (rdy) , dh) sized sub-DA. This strategy can be shown for all the terms in the 
objective in (40 1 . With this, we can begin with set of sub-DAs of size ((rdy) , dh) each and pre-train then one at-a-time 
in a synchronous manner, thereby justifying the distributed setting for DA pre-training. 

Now consider such a setup where many such ([rd.u], dh) sub-DAs are learned synchronously by randomly sam¬ 
pling different subsets (rdy) of the total available visible units. It is easy to see that, in expectation this sequential 
distributed learning is equivalent to minimizing all the terms inside the expectation in ( |40| ). Hence learning the big 
(dy, dh) DA is the same as sequentially learning small DAs of size ((rdy), dh) where the units (rdy) are chosen at 
random. In practice, this is achieved only if each of the visible unit is included in at least one of the sub-DAs (i.e., 
all unknown parameters are updated at least once). Let B be the number of sub-DAs that are learned sequentially. If 
0 < 0 <C 1 denotes the probability that a given unit is not in all the B sub-DAs (ideally, f should be small in practice). 
Then, it is easy to see that this probability is given by (1 — t)^ because the probability that a particular unit is sampled 
to be included in one sub-DA is r. Since 1 — r < 1, we then have 


(1 - r)® < f 




log(l - r) 


(43) 


We now relate the corruption probabilities of the sub-DAs (denoted by q) to that of the mother DA (Q. Recall that 
clamping is the same as corrupting (i.e., setting the input from that unit to be 0). Given the sampling fraction r, the 
probability that a given visible unit (1,..., dy) belongs to one sub-DA is t . Further, if q is the corruption probability 
of this sub-DA, then the un-clamping probability of a given unit is (1 — r) -f rq. If the B sub-DAs are constructed 
independently by sampling the visible units with replacement, then the overall corrupting (un-clamping) probability is 
= 1 _ T- _|_ T-q \Ye require this to be equal to which then gives q = 1— (with 1 — C < t < 1 ). □ 


The above statement (proof in supplement) establishes the equivalence of distributed DA (dDA) pre-training to 
the non-distributed case by explicitly considering the DA’s property of using nullified/corrupted inputs (which then 
provide no new information to the objective). We remark that Lemma 4.1 is specific for the case of DAs and (unlike 
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many other results in this paper) may not be directly applicable for other types of auto-encoders that do not involve an 
explicit corruption function. Also, t and should be chosen carefully so that 1 — C < r and 1 — does not end up 


too close to 1. Specifically, whenever ^ is very small, according to Lemma 4.1 there is very little room for distribution 
because t will be close to 1. This is not surprising because, with small (^, the DA is allowed to discard visible units very 
rarely, pushing r closer to 1, where the distributed setup tends to behave like the non-distributed case. Although these 
requirements seem too restrictive, we show in Section]^ that they can be fairly relaxed in practice. Overall, Lemma 
4.1 provides some justification (from the perspective of the autoencoder design itself) for distributing the learning 
process. The lower bound on B in Lemma |4~T| ensures that all the unknown parameters are updated in at least one of 
the B sub-DAs. Hence, in practice, (j) can be chosen to be very small and the sub-DAs can be explicitly sampled to be 
“non-overlapping” (i.e., disjoint with respect to the parameters). Once the hyper-parameters r, C and B are fixed, the 
recipe is simple. The dDA pre-training setup will involve running B individual RSGs on randomly sampled disjoint 
sub-DAs. The B sub-DAs share a common parameter set which holds the latest estimates of W. 

Similar to the multi-fold RSG setup in Section we perform M meta-iterations of the dDA pre-training, where 
each meta-iteration involves learning B number of sub-DAs. Because sub-DAs are constructed randomly, different 
meta-iterations end up with different set of sub-DAs, ensuring low variance in the estimate of W corresponding to the 


(e, ^)-solution (34 1 . It is clear that due to the reduction in the size of the network by a factor of r, the convergence rate 


and required sample sizes (see (24 1 and (33 i) will improve in this distributed case. This observation is formalized in 
the two results below. Here, 7 ^ denotes the step size in meta-iteration for 6 *^ RSG (corresponding to 6 *^ sub-DA) 
and N is the number of SJ-O calls for each of the B RSGs. The subscript b in represents the updates of 6 *^ 
RSG where R}, is its stopping iteration. 


Corollary 4.2 (Convergence of one-layer dDA). The optimal constant step size 7^ is given by 


k 

Tb = 


D 


s/N{Tdhd^Yl‘'‘ 


'ib,k- 0 < D < ^{rdhd^y^* 


By selecting B according to Lemma\4.1\ and denoting D = + DL^L', we have, 

s/N 


E(||Vw/(Wf^)f) < D 


(44) 


(45) 


Proof. The proof for this theorem emulates the proofs of Lemma |3.2|and Corollary |3.4| First we derive an upper 
bound on the expected gradients similar to the one in ([^ of Lemma (O] Using this bound, we then compute the 
optimal stepsizes and the rate of convergence. 

In the distributed setting, we have B number of RSGs running synchronously (or sequentially) and the size of each 
of the B sub-DAs is ([rdt,], dt). This is the same as a (d«, dt) DA with rdhdy unknowns (for notational convenience 
the ceil operator [•] is dropped in the analysis). So, the bounds on the variance of noisy gradients (G'(p^; W^)) and 
the Lipschitz continuity of /(W) change as follows. 


Var(G(?7'=;W'=)) < rdhd^L^ 

||Vw/(W) - Vw/(W)|| < ^TdM'WW - W|1 


(46) 


We then have the following inequality for each of the B RSGs based on the analysis in the proof of Lemma 3.2 
until ( [Tol l 


Nb 

E 


% 


s/rdhdyL' 


umr < /(yvt) - 


(47) 




It should be noted that the subscript b indicates RSG i.e., is the update from RSG. denotes the 
maximum number of iterations of RSG. Although the size of W is dh x dy, only rdhdy of the total dhdy are being 
updated within a single RSG. 
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Now recall that the sequential nature of the B RSGs implies that the estimate of W at the end of 6 *^ RSG will be 
the starting point for the (6 + 1)*'* RSG. This implies that = /(Wj^^) for all 6 = 1,..., i?. Using this 

fact, we can then sum up all the B inequalities of the form in ^f\ to get, 


B Nh 

EE 

6=1 k=l 


76 


y/rdhdyL' 


ijbf ) umr < /(wi) - /(w^"+') 


B Nb 


-EE(^' - 


(48) 


6 f 


6=1 fc=l 


6=1 k=l 


Using the fact that /* < we then have 


B Nb 

EE 

6=1 fe=l 


76 


^/rdhdyL' 




B Nb 


^rdhdyL 


, B Nb 


(49) 


-EE (7 - VrdkdyL'i^^r){v^fiwt),si) + ^ E EE( 7 fii<^i 


6 f 


6=1 fc=l 


6=1 fc=l 


We now take the expectation of the above inequality over all the random variables involved in the B RSGs, which 
include, the B number of stopping criterions i? 6 , & = 1, ■ ■ ■, B and the random processes ,b = 1,... ,B 
the random process of rj within 6 *^ RSG). First, note the following observations about (Vw/(W^), (5^) and ||(5^|p 


E„™(G(%^W,^))=Vw/(Wn) =7 (Vw/(W,"),J,^)=0 

Bb 

Sik\\2 


n) 

„k. \-xrk\ 


E i«.i||<56lr = Var^GU-,^^)) < rdM^ 

'lb 


(50) 


which follow from ( |46| ). This implies that after taking the expectation of the inequality in ( |49| ), the last two terms on 
the right hand side will be. 


E [jv] 


B Nb 


E 7 - (Vw/(W^-), S^} 

,6=1 fc=l ^ ^ 

S -l^b / \ 

= EE 76 ^ - VrdhdyL'ij^f j E^[.b]((Vwf(W^),d^}lvl •■■,%")= 0 


(51) 


6=1 fc=l 


E [N] 






6=1 k—1 


''^^^' EE(76)Xt”^'dll'll') 


6=1 fc=l 

^ {sjTdhdyYL'L"^ y^ ( 7 ^)^ 

^ k=l 


(52) 


where denotes the composition of the B random processes ^[^>>1 ,6 = 1,..., B. Using ( [5T] | and ( |5^ and ( |49] l, we 
get 

B Nb 

( 27 ,^ - E^[»,||/(W^)f 


6=1 k=l 


B Nb 


(53) 


< 2 (/(Wi) - n + {^J^yfL'L^ E E(76 )' 


6=1 k—1 


Recall the definition of ^^{k) from (|7]i in Lemma 3.2 which is 

27^= - 


Pij(fc) = Pr{R = k) = 


Sfc=i 27 '' - L'^ydhdyi^'^y 


k = l,...,N 


(54) 
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(55) 


Adapting this to the current case of B sequential RSGs, we get 

27 ^ - L'^/Tdhdy{'^^f 


Rt 


(k) = Pr{Rb = k) := 




k = l,...,N h = l,...,B 


Using this distribution of stopping criterion and taking the expectation of ( [5^ with respect the set of random variables 
to Rb, b = 1 ,..., B, we get 


B Ni 


E(||Vw/(Wf)f)=^^ 


6=1 fc=l 


( 27 ^ - BV75;A(7b^)^)E,w(||Vw/(Wf)|r 

Ef=i EtiiH - 

N 


< 


Df + iV^.YL^L' Eb=i 


(56) 


Eh e:=i(h - L'v^Yih) 




Observe that whenever B is selected as in Lemma |4TT| with sufficiently high (j), each of the dhdy unknowns is updated 
in at least one of the B RSGs. Hence, all the unknowns are covered in the left hand side above, see [56] 

We now compute the optimal stepsizes and the corresponding convergence rate using the upper bound in ( |5^ . At 
any given point of time only one of the B RSGs will be running. So, using Lemma 3.4 the optimal constant stepsize 
for RSG is then given by 


D 


lb = lb = 


y/Wb{TdhdyYl^ 

Assuming Nb — N for all 6 = 1,..., B, we then have 

^ D 


where D < {rdh.dvY^'^ 
I/' 


Vn , 


^ = Tmlwn ^ ^ 


(57) 


With this in hand, we now derive the convergence rate. Using some constant stepsizes lb = 1 for k,h and 
assumption that Nb = N for all b, the upper bound in ([56]) becomes 


E(||Vw/(Wf )f) < 


< 


Df + {y/Tdhd^YL^BNB-f^ 
NB{2'y - U^/rdEdYl^) 
Df + {,/¥afL^L'NBj^ 
NB-f 


where the last inequality uses the fact that (2 — L'y/rdhdyj) > 1 (which follows from Lemma 3.3 
deriving the stepsizes in Lemma jU^. Substituting for 7 from ( fST] ) in the above inequality gives. 


(58) 


and was used in 


E(||Vw/(Wf )f) < ^ + {rdbd^hL^L'j 

_ B/(r4d„)3/4 ^ J^-^L'irdhd^fD 


(59) 


By denoting D = + DL'^L , we finally have 

E(||Vw/(Wf^)f) < D 


Vndb Vn 

{rdhdyY^'^ 


Vn 


(60) 

□ 


Corollary 4.3 (Sample size estimates of one-layer dDA). The number of meta-iterations (M) and the number of 
data instances (S) required to compute a (e, S)-solution in the distributed setting are 


M{r,6) > 


log(i) ^ 

log(v^) 


S{,r,e) > 


r{Tdhdy)^/^ 


( 61 ) 


where r > 1 is a given constant and t denotes the average number of times each data instance is used within each 
sub—DA. 
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Proof. First observe that there is no randomization of data instances across the B sub —DAs. Hence we can compute 
the sample sizes S from a single sub-DA. Secondly, since B > 1, D in Corollary |4.2| is such that D < ^ + DL^L'. 
Using these two facts, the computation for S then follows the steps in Lemma [33| with dhdy replaced by rdhdy. Hence 
we have. 


5'(r,e) > 


rirdfidyf/"^ 

te"^ 


(62) 


To compute the bound for M we follow the same steps in the proof of Lemma [T5j and end up with the following 
inequality 


Pr(||Vw/(W«=)f >e5) < 


{rdhdyY/'^ 

eVN 


(63) 


Since N is the number of calls for each of the B RSGs, we have N 


r{Tdhd^) Then we have. 


Pr 


min llVw 


/(W 


-Rc 


>eD] < 


M 

n 

C=1 


1 

y/r 


rC/2 


(64) 


and hence M(r, 5) > 


log(?) 

log(Vr) 


□ 


These results show that, whenever B is chosen as in Lemma 
improve by and respectively, if the stepsize is appropriate. 
C is not unreasonably small (or r is not too close to 1). 


4.1 the convergence rate and sample sizes will 


The improvements may be much larger whenever 


5 Experiments 

To evaluate the bounds presented above, we pre-trained a one-layer DA on two computer vision and one neuroimaging 
datasets - MNIST digits. Magnetic Resonance Images from Alzheimer’s Disease Neuroimaging Initiative (ADNI) and 
ImageNet. These will be referred to as mnist, neuro and imagenet. See supplement for complete details about these 
datasets, including the number of instances, features and other attributes. Briefly, neuro dataset has stronger correla¬ 
tions across its dimensions compared to others, and imagenet includes natural images and is very diverse/versatile. 

Our experiments are two-fold. We first evaluate the non-distributed setting (Corollary |3.4| ( [24| ) by computing 
the expected gradients vs. the number of SFO calls {N) and the network structure dh). We then evaluate the 
distributed setup (Coroll ary |4.2[ ( |45] l) by varying the number of disjoint sub-DAs {B) that constitute the network. The 
expectations in ( |24l i and (|45|l are approximated by the empirical average of gradient norm (last 100 iterations). Since 
we are interested in the trends of convergence rates, all plots are normalized/scaled by the corresponding maximum 
value of expected gradients. Figure [T] shows these results: the first and second columns correspond to the non- 
distributed setting and the last column corresponds to the distributed setting. Each row represents one of the three 
datasets considered. 


Expected gradients vi N. Figure[T];a,d ,g) show that the expected gradients decrease as the number of SPO calls {N) 
increases. The three curves (red, black, blue) in each plot correspond to different stepsizes. The expected gradients 
decrease monotonically for all the curves in Figure [2a,d,g), and their hyperbolic trend as N increases supports the 
decay rate presented in (24i. Unlike mnist and imagenet, neuro has stronger correlations across its features, 
and so shows a decay rate seems to be stronger than (the red curve in Figure j^d)). The gradients, in general, 
also seems to be smaller for larger stepsizes (blue and black curves), which is expected because the local minima are 
attained faster with reasonably large stepsizes, until the minima are overshot. Supplement shows a plot indicative of 
this well-known behavior. 


Expected gradients vs dy, dh- The second column (Figure [2b,e,h)) shows the influence of increasing the length of 
the visible layer {dy) for multiple df^ and fixed N . As suggested by ( [24| , the expected gradients increase as dy 
increases. This rate of increase (vs. increasing dy on x-axis) seems to be stronger for smaller values of dh (black and 
green curves vs. red and blue curves). Recall that dh should be “sufficiently” large to encode the underlying input 
data dependencies Paugam-Moisy (|1997|l; Lawrence et al. ( 1998[); Bianchini & Scarselli (2014|l. Hence the network 
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(i) imagenef. Expected gradients vs. B 


Figure ll Expected gradients. First (a,d,g) column shows the expected gradients vs the number of STO calls N, for multiple stepsizes 7 
(corresponding to red, black and blue colors). Second (b,e,h) column shows the expected gradients vs. the size of visible layer dy for multiple 
d^s (corresponding to red, blue and black colors). Third column (c,f,i) presents expected gradients vs. the number of sub-DAs (B > 1) used in a 
distributed asynchronous setting (for a fixed iterations N and network size d^Av). For the results in first and last columns, dy equals the inherent 
input data dimensionality (see supplement), and dh is one-tenth of dy. Top row coiTesponds to mnist, second to neuro and third to imagenet. All 
the expected gradients are normalized with the maximum value in the respective plot. 


may under-fit for small dh, and not recover inputs with small error. This behavior is seen in Figure [TJb,e,h) where 
initially the expected gradients (across all dyS) gradually decrease as dh increases (black, green and red curves). Once 
dh is reasonably large, increasing it further tends to increase the expected gradients (as shown by the blue curve which 
overlaps the others), dh may hence be chosen empirically (e.g. using cross validation), so that the network still 
generalizes to test instances but is not massive (avoids unnecessary computational burden). 
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Figure 2l The relative time speed-up achieved by distributed pre-training (vs. non-distributed) as a function of the number of cores (a:-axis), 
which in this experiment is equal to the number B of sub-DAs (see Lemma [4~T) i.e. each of core works on one sub-DA. Curves correspond to 
different number of parameters. Step-sizes ai‘e scaled according to|17| while N is fixed for each curve. 


Does distributed learning help ? The last column in Figure [T] shows the expected gradients in a distributed setting 
where x-axis represents the number of sub-DAs {E) into which the whole network is divided. The number of B’s is 
chosen such that dh is no larger than twice the size of dy. Corollary |4.2| presents the bounds with respect to t which 
is the fraction of visible layers used in each of the sub-DAs. The results in Figure [^c.f.i) are shown relative to the 
number of disjoint sub-DAs B, which is chosen to be at least 1 /t and follows the conditions in Lemma [43] Observe 
that, the expected gradients decay as B increases for all the three datasets considered. For a sufficiently large B, the 
decay rate settles down with no further improvement, see Figure[TJf,i). The bounds derived in Sectionj^are based on a 
synchronous setup. In our experiments a central master holds the current updates of the parameters, and the B different 
sub-DAs pre-train independently on as many as 200 cores, communicating with the master via message passing. The 
sub-DAs are initialized by running the whole network (in a non-distributed way) for a few hundred iterations. 

Figure]^ shows the time speed-up achieved by distributing the pre-training (relative to the non-distributed setting) 
on neuro and imagenet. Note that the number of sub-DAs used is equal to the number of cores used, which means one 
sub-DA is pre-trained per core. As the number of cores used increases, the speed-up relative to the non-distributed 
setting increases rapidly up to a certain limit, and then gradually falls back. This is because for large values of B 
the communication time between machines dominates the actual computation time. The speed-up is much higher 
for datasets with large number of parameters (> SOmil, red and black curves vs.lSmil, blue curve). Note that the 
distributed setting gives faster convergence and time speed-up, but does not lose out on generalization error (refer to 
the supplement for a plot conhrming this behavior). Lastly, these computational (Figure [^c,f,i)) and time speed-up 
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in tandem with existing observations [Bengio (20091; Erhan et al. 


20 10 |l;[Vincent et ar](|2010|l;|Dean et aL](|20T2|l provide strong empirical support to the convergence and sample 


size bounds constructed in SectionsDjandffl 


6 Conclusion 

We analyzed the convergence rate and sample size estimates of gradient based learning of deep architectures. The 
only assumption we make is on the Lipschitz continuity of the loss function. We provided bounds for classical and 
distributed pre-training for Denoising Autoencoders, and the experiments support the suggested behavior. We believe 
that our results complement a sizable body of work showing the success of empirical pre-training in deep architectures 
and identihes a number of interesting directions for additional improvements - both on the theoretical side as well as 
the design of practical large scale pre-training. 


Appendix (Supplementary Material) 

Datasets Description The three datasets that were used, mnist, neuro, imagenet, correspond to the smalld-largen, 
larged-smalln and larged-largen setups respectively {d is the number of data dimensions, n is the number of data 
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instances). 


• mnist: This famous digit recognition dataset contains binary images of hand-written digits (0 — 9). We used 
10^ of these images which are part of the mnist training data set (http://yann.lecun.com/exdb/mnist/). The 
training data contains approximately equal number of instances for each of the ten classes. Each image is 784 
pixels/dimensions, and the signal in each pixel is binary. No extra preprocessing was done to the data. 

• neuro: This neuroimaging dataset is a prototypical example of dataset with very large number of features, 
but small number of instances. It comprises of Magnetic Resonance Imaging (MRI) data from Alzheimer’s 
Disease Neuroimaging Initiative study from a total of 534 subjects. Each image is three-dimensional of size 
256 X 256 X 176. Each voxel in this 3D space corresponds to water-level intensity in the brain, and the sig¬ 
nal is positive scalar. Standard pre-processing is applied on all the images, which involves stripping out grey 
matter and normalizing to a template space (called MNI space). Refer to Statistical Parametric Mapping Tool 
(SPM8, http://www.fil.ion.ucl.ac.uk/spm/doc/) for this standardised procedure. The resulting processed images 
are sorted out according to the signal variance. Eor the experiments in thie work, we picked out the top (most 
variant) 25% of the features/voxels, which amounted to 3 x 10"^ features. Even within this setting the number 
of features is much larger than the number of instances available (534). 

• imagenet: This well-known dataset comprises of natural images from various types of categories collected as 
apart of WordNet hierarchy. It comprises of more than 14 Million images, broadly categorized under more 
than 20 thousand synsets (http://www.image-net.org/). We used imaging data from five of the largest categories 
contained in the imagenet database. This amount to > 7000 synsets/sub-categories and approximately 5 million 
images. As a pre-procesing step, we resized all images to 128 x 128 pixels, and centered each of the 16384 
dimensions. 




Eigure 3: (a) Distributed setup does not lose on generalization error. The four curves correspond to the ratio of test-set reconstruction errors for 
distributed pre-training (B > 2) to the non-distributed case. The error-bars correspond to 10 fold cv errors computed using 10 different test-sets, 
(b) Expected gradients vs the number of SBO calls N, for multiple stepsizes 7 (corresponding to the four different colors). The trends show that 
as stepsize increases the expected gradients decrease, and beyond a resonably large stepsize (gree curve) the gradients overshoot local optima (blue 
curve). 
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